Keith Packard: Shared Memory Fences
Shared Memory Fences
In our last adventure, dri3k first steps, one of the future work
items was to deal with synchronization between the direct rendering
application and the X server. DRI2 handles this by performing a
round trip each time the application starts using a buffer that was
being used by the X server.
As DRI3 manages buffer allocation within the application, there s
really no reason to talk to the server, so this implicit
serialization point just isn t available to us. As I mentioned last
time, James Jones and Aaron Plattner added an explicit GPU
serialization system to the Sync extension. These SyncFences
serializing rendering between two X clients, but within the server
there are hooks provided for the driver to use hardware-specific
serialization primitives.
The existing Linux DRM interfaces queue rendering to the GPU in the
order requests are made to the kernel, so we don t need the ability to
serialize within the GPU, we just need to serialize requests to the
kernel. Simple CPU-based serialization gating access to the GPU will
suffice here, at least for the current set of drivers. GPU access
which is not mediated by the kernel will presumably require
serialization that involves the GPU itself. We ll leave that for a
future adventure though; the goal today is to build something that
works with the current Linux DRM interfaces.
SyncFence Semantics
The semantics required by SyncFences is for multiple clients to block
on a fence which a single client then triggers. All of the blocked
clients start executing requests immediately after the trigger fires.
There are four basic operations on SyncFences:
- Trigger. Mark the fence as ready and wake up all waiting clients
- Await. Block until the fence is ready.
- Query. Retrieve the current state of the fence.
- Reset. Unset the fence; future Await requests will block.
static inline long sys_futex(void *addr1, int op, int val1,
struct timespec *timeout, void *addr2, int val3)
return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
For this little exercise, I created two simple wrappers, one to block
on a futex:
static inline int futex_wait(int32_t *addr, int32_t value)
return sys_futex(addr, FUTEX_WAIT, value, NULL, NULL, 0);
and one to wake up all futex waiters:
static inline int futex_wake(int32_t *addr)
return sys_futex(addr, FUTEX_WAKE, MAXINT, NULL, NULL, 0);
Atomic Memory Operations
I need atomic memory operations to keep separate cores from seeing
different values of the fence value, GCC defines a few such primitives
and I picked _syncboolcompareandswap and
_syncvalcompareandswap. I also need fetch and store operations
that the compiler won t shuffle around:
#define barrier() __asm__ __volatile__("": : :"memory")
static inline void atomic_store(int32_t *f, int32_t v)
barrier();
*f = v;
barrier();
static inline int32_t atomic_fetch(int32_t *a)
int32_t v;
barrier();
v = *a;
barrier();
return v;
If your machine doesn t make these two operations atomic, then you
would redefine these as needed.
Futex-based Fences
These wake-all semantics of Fences greatly simplify reasoning about
the operation as there s no need to ensure that only a single thread
runs past Await, the only requirement is that no threads pass the
Await operation until the fence is triggered.
A Fence is defined by a single 32-bit integer which can take one of
three values:
- 0 - The fence is not triggered, and there are no waiters.
- 1 - The fence is triggered (there can be no waiters at this point).
- -1 - The fence is not triggered, and there are waiters (one or more).
int fence_await(int32_t *f)
while (__sync_val_compare_and_swap(f, 0, -1) != 1)
if (futex_wait(f, -1))
if (errno != EWOULDBLOCK)
return -1;
return 0;
The basic requirement that the thread not run until the fence is
triggered is met by fetching the current value of the fence and
comparing it with 1. Until it is signaled, that comparison will return
false.
The compareandswap operation makes sure the fence is -1
before the thread calls futex_wait, either it was already -1 in the
case where there were other waiters, or it was 0 before and is now -1
in the case where there were no waiters before. This needs to be an
atomic operation so that the fence value will be seen as -1 by the
trigger operation if there are any threads in the syscall.
The futex_wait call will return once the value is no longer -1, it
also ensures that the thread won t block if the trigger occurs
between the swap and the syscall.
Here s the Trigger function:
int fence_trigger(int32_t *f)
if (__sync_val_compare_and_swap(f, 0, 1) == -1)
atomic_store(f, 1);
if (futex_wake(f) < 0)
return -1;
return 0;
The atomic compareandswap operation will make sure that no Await
thread swaps the 0 for a -1 while the trigger is changing the value
from 0 to 1; either the Await switches from 0 to -1 or the Trigger
switches from 0 to 1.
If the value before the compareandswap was -1, then there may be
threads waiting on the Fence. An atomic store, constructed with two
memory barriers and a regular store operation, to mark the Fence
triggered is followed by the futex_wake call to unblock all Awaiting
threads.
The Query function is just an atomic fetch:
int fence_query(int32_t *f)
return atomic_fetch(f) == 1;
Reset requires a compareandswap so that it doesn t disturb things if
the fence has already been reset and there are threads waiting on it:
void fence_reset(int32_t *f)
__sync_bool_compare_and_swap(f, 1, 0);
A Request for Review
Ok, so we ve all tried to create synchronization primitives only to
find that our obvious implementations were full of holes. I d love
to hear from you if you ve identified any problems in the above code,
or if you can figure out how to use the existing glibc primitives for
this operation.